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C/3 ■ Region-based compilation repartitions a program into more desirable compilation units using profil- 

^ ' ing information and procedure inlining to enable region formation analysis. Heuristics play a key role in 

determining when it is most beneficial to inline procedures during region formation. An fLP optimizing 
compiler using a region-based approach restructures a program to better reflect dynamic behavior and 
^ ' increase interprocedural optimization and scheduling opportunities. This paper presents an interproce- 

CO , dural compilation technique which performs procedure inlining on-demand, rather than as a separate 

nI ' phase, to improve the ability of a region-based optimizer to control code growth, compilation time and 

memory usage while improving performance. The interprocedural region formation algorithm utilizes 
a demand-driven, heuristics-guided approach to inlining, restructuring an input program into interpro- 
cedural regions. Experimental results are presented to demonstrate the impact of the algorithm and 
j^S^ , several inlining heuristics upon a number of traditional and novel compilation characteristics within a 

'^ — ■ region-based ILP compiler and simulator. 
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Abstract 



1 Introduction 



Advanced instruction-level parallel (ILP) computer architectures require aggressive and potentially costly 
C^ ' whole program, or interprocedural, techniques for program analysis and optimization to fully exploit avail- 

able parallelism. These interprocedural techniques are in contrast to intraprocedural code improvement 
techniques employed in a traditional procedure-oriented compiler, where analysis and optimization phases 
are independently applied to each procedure in isolation. ^ 

An approach for ILP that reduces the cost of aggressive interprocedural analysis and optimization is 
region-based compilation |20| . Region-based compilation is a generalized trace selection approach that parti- 
tions a program into units of compilation, or regions, based on profile information. Using procedure inlining, 
where a procedure callsite is replaced by the body of the called procedure, and restructuring a program 
into regions, the region-based compiler can perform code motion and other analyses and optimizations inter- 
procedurally, while maintaining control over the compilation unit size and content. Unlike procedure-based 
compilation, region-based techniques bound the compilation unit size to better control optimization costs [20] . 

The key component of a region-based compiler is the region formation phase which partitions the program 
into regions using profile-guided heuristics with the intent that the ILP optimizer will be invoked with a scope 
that is limited to a single region at a time. Thus, the quality of the generated code depends greatly upon 
the ability of the region formation phase to create regions that a global optimizer can effectively transform 
in isolation for improved ILP. Because region-based compilation relies on an initial aggressive inlining phase. 



region formation remains quite costly, particularly for large programs with many procedures and calls '20' . 
Selective use of inlining can prevent excessive code growth and control register pressure while improving 
analysis opportunities and performance [7]- 

In this paper, a strategy to overcome the issues caused by separate inlining and region formation phases 
is described and evaluated. Presented is a demand-driven approach to inlining and a set of inlining heuris- 
tics which are integrated within a region-based optimizer. To evaluate these techniques, the algorithm and 
various heuristics for guiding inlining decisions have been implemented within the Trimaran ILP research 
compiler |28 |. In addition to standard metrics such as compilation time, code growth and execution time, 
novel metrics have been devised to compare the characteristics of regions, such as profile homogeneity and 
interprocedural scope, to measure the effectiveness of this new approach. 

2 Region-based Compilation 

A common characteristic of compiler analysis techniques, including those specifically for ILP architectures, 
is that they have been designed with the assumption that the original procedure boundaries created by the 
programmer are immutable. Procedures serve as the de facto unit of compilation. As a result, there is the 
potential for large procedures to either unacceptably increase compilation time or to be less aggressively 
optimized (or not optimized at all) in order to control compilation costs and maintain scalability. Procedure 
boundaries are a natural impediment to compilation effectiveness in many cases, requiring tradeoffs in terms 
of quality of optimization versus compilation time and memory requirements. 

Hank et al. 20 proposed the region-based compilation framework as a solution to the problem of exposing 
interprocedural scheduling and optimization opportunities without the cost of very large procedure bodies 
created through inlining, or the expense and complexity of sophisticated interprocedural analysis and code 
motion. While it was shown to be especially beneficial in an ILP compiler, region-based compilation also can 
achieve both interprocedural scope and scalability in program analysis. 

2.1 Fundamental region formation 

Figure n depicts the organization of a region-based compiler framework. The source code enters the Profiler, 
where the source code is instrumented and executed to gather profile information which is then integrated 
into the source code. Intermediate code with profiling information is input to the Aggressive Inliner phase, 
where all inlining that can be done in the entire program, subject to some constraints, is performed. Next, 
in the Region Formation phase, regions are formed throughout the whole program, and each region is 
encapsulated as a procedure in the Encapsulation phase. The encapsulated regions are then passed to a 
high-level Optimizer phase before Reintegration into their original procedures. The result is passed to the 
Code Generator which includes a low-level optimization phase. 

In this framework, a region is a collection of basic blocks and control flow edges selected for compilation 
as a unit [201 ■ More formally, a region is a subgraph of the control flow graph (CFG) of a procedure, created 
either based on the structure of the CFG or using profile information. Each region is encapsulated in a 
single-entry, single-exit CFG by adding dummy prologue and epilogue CFG nodes and boundary condition 
CFG nodes that convey pertinent data flow information. Regions are encapsulated in such a way that the 
optimizer can be invoked with a scope that is limited to a given region, which then appears to the rest of the 
compiler as a procedure. Side entries into regions can be removed by tail duplication, similar to superblock 
formation |22| . After optimization, each region is reintegrated into the original procedure in which the region 
existed by updating changes in data flow conditions, entry and exit points, and constraints on register 
allocation. Code is generated from the reintegrated procedure. 

2.2 Example 

The original profile-sensitive region formation algorithm is comprised of the following steps, performed be- 
tween aggressive inlining and region encapsulation. These steps are performed until all blocks in the program 
have been included in some region. Figure El shows the results of performing the following steps of the algo- 
rithm. Figure |2Ib) shows the code after aggressive inlining is performed on the code in Figure |2Ia). 
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Figure 1: Original region-based compilation framework. |2(J| 



Step 1: Seed Selection - From among all basic blocks in the procedure not yet included in a region, select 
the block with the highest execution frequency; this is the seed block for a new region. In this simplified 
example, this is block 8, shown in Figure |2Ib). Note that inlining was done previously. 
Step 2: Region Expansion to Successors - A path of desirable successors is selected, starting at the 
seed block. Region expansion is guided by heuristics which halt the growth under a set of conditions such 
as [5n]: (1) a procedure call is reached, (2) a minimum acceptable execution frequency for a successor block 
is not met (e.g., at least 50% of the frequency of both its immediate predecessor in the region and that of 
the seed block, which in this simplified example is why block 6 is not selected in this step), or (3) a region 
size threshold (e.g., 200 basic blocks) is exceeded. The successors selected for seed block 8 are blocks 10, 11, 
5 and 7. 

Step 3: Region Expansion to Predecessors - A path of frequently executed predecessors to the seed 
block is chosen analogous to the selection of desirable successors. The resulting path after this step is the 
seed path of the region. In this case, blocks 2 and then 1 are added as predecessors of seed block 8. 
Step 4: Region Expansion from All Blocks in the Seed Path - By selecting as above the desirable 
successors of all current blocks in the region, the region is grown along multiple control flow paths. Thus, 
block 3 is added to the region. The result of this step is a path-sensitive region. Blocks not yet in a region 
(blocks 6 and 9) are used to form additional regions. 

To summarize, three regions are formed in the example. The largest region consists of blocks 1, 2, 3, 
5, 7, 8, 10, and 11. The remaining blocks 6 and 9 form single block regions. Note that original block 4 
was replaced by the inlined procedure G. Limitations include the potential for excessive code growth and 
unnecessary inlining due to the aggressive approach to inlining, leading to unscalability, and the training-data 
effect of profile-guided compilation. While Hank's approach can achieve scalability during program analysis 
and optimization by allowing the compiler to control the size of regions, region formation is unscalable due 
to aggressive inlining. 



3 Region Formation Analysis with Demand-driven Inlining 

Interprocedural regions that include instructions from more than one procedure enable region-based compi- 
lation to uncover optimizations missed due to procedure boundaries [201 • This section proposes an alternative 
approach to building interprocedural regions which performs inlining on a demand-driven basis integrated 
within region formation analysis is presented in this section. By delaying inlining decisions until region 
formation analysis, the characteristics of inlined code can be better controlled, reducing code growth and 
memory requirements. However, inlining performed in this demand-driven way introduces a number of issues 
that are not present in existing region formation techniques; these issues are enumerated, and a technique is 
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Figure 2: Example of the steps in Hank's region formation algorithm. 



proposed to addresses them. In the remainder of this paper, the approach of aggressive inlining followed by 
intraprocedural profile-sensitive region formation (i.e., Hank, et al.) is referred to as Phased_region, and the 
new demand-driven approach is called Demand-region. 



3.1 Challenges in Forming Interprocedural Regions 

Major issues to consider in the design of Demand-region are: 

Issue 1. Inlining is driven by the demand placed at procedure callsites as regions are formed. 

Callsites may be encountered as a most frequent successor or predecessor of a block on a path within the 
current region being formed. The path selection process must determine at that point whether or not the 
callee should be inlined, a decision dependent on the heuristics used to guide inlining. If the decision is 
made to inline a procedure, it is inlined and region formation proceeds within the callee's code. Thus, 
interprocedural regions are identified by having the region formation process cross procedure boundaries by 
inlining on demand. 

Issue 2. Region formation analysis must deal with multiple calls to the same procedure as it 
crosses procedure boundaries. While region formation on the flattened, aggressively inlined code of Phas- 
ed_region analyzes a distinct code segment for each callsite that has been inlined, region formation without 
prior inlining analyzes the same code for a procedure's body for each callsite to that procedure. Depending on 
the context, a callee could be partitioned into different regions for different callsites. Demand-region should 
maintain separate information about a procedure for each inlinable callsite to that procedure, or partition 
the procedure the same each time. 

Issue 3. The ordering of procedures analyzed for region formation and inlining impacts com- 
pilation overhead. Performing demand-driven inlining can lead to large compilation and runtime memory 
requirements similar to Phased-region if the order in which inlining and region formation is performed is not 
carefully considered. As a callsite is encountered in Demand-region, the region formation algorithm begins 
to form regions in the callee. Thus, the amount of code growth and the size of data structures needed during 
region formation are dependent on the handling of the worklist of blocks for partitioning as region formation 
crosses procedure boundaries. 

Issue 4. Procedures may not be inlined at every callsite. While a procedure's code is partitioned into 
regions on demand at callsites, at some of those callsites the decision may be made not to inline, resulting 
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Figure 3: Illustration of region classification for individual procedures. 



in the procedure being partitioned into local regions in isolation of a calling context. Thus, a record of the 
inlining of each procedure should be maintained to identify procedures that need to be processed in isolation 
during region formation. 

Issue 5. Total code growth is an imprecise limiting metric in Demand_region since each region 
will be analyzed and optimized separately. A limit on the memory requirements for Phased_region 
is achieved by restricting how large the program can grow in total size during the aggressive inlining pass; 
however, individual procedures may be able to grow very large. This is problematic, since memory require- 
ments during analysis are proportional to the size of the largest procedure. Demand-driven inlining can also 
ensure that individual procedures do not grow excessively large by making use of heuristics that consider 
the impact of inlining before it is performed. 

Issue 6. Region formation may be partially completed in multiple procedures simultaneously. 
With Demand-region, region formation proceeds recursively. Region formation starts in a procedure, and 
when a callsite is reached it may continue recursively into the callee, temporarily suspending region formation 
in the caller. Thus, region formation is partially completed in the calling procedure and will only complete 
after region formation is completed in the callee. When additional levels of recursive region formation occur, 
region will be in various stages of completion along the entire call chain, completing as each callee invocation 
returns. 



3.2 A Classification of Regions of a Procedure 

The interprocedural region formation algorithm addresses each of the described issues, based on a classifica- 
tion of regions in a single procedure. Regions are classified with respect to individual procedures and callsites 
where they are invoked. Figure which contains control flow graphs for three procedures and the formed 
regions in different shadings, illustrates each of the different classifications of regions. A region in / that 
includes either the entry or exit block of / is an interprocedural region. An interprocedural region can be 
either entry, exit, or pass-through. For each procedure /, each callsite c with a call to / has a single entry 
region associated with /, entry f^c which is the region that contains the entry block of /. At the one callsite 
in A to procedure B in the figure, the entry region associated with B contains not only the entry block in 
B but a path that passes through to the exit of B, and contains the exit of B also. At the callsite in B to 
procedure C, the entry region associated with C contains the entry block in C and only two other blocks in 
C. 

Similarly, each callsite c to procedure / has a single exit region, exitf,.- As is the case for the one callsite 
in AtoB, entry f^c and exitf^c could in fact be the same region because the region follows a path that passes 
through from entry to exit; in this case, it can be said that this region is an interprocedural pass-through 
region of / at callsite c. All remaining regions containing blocks in / are local regions, or local f^c, as they 
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Figure 4: Demand-driven region-based compilation framework. 



do not involve blocks from the caller of /. Note that / may not be partitioned into the same regions for 
every callsite to /, since region formation within / is based on the context surrounding the callsite to /. 



3.3 An Algorithm for Region Formation with Demand-driven Inhning 

Figure 0] presents the organization of the Demand_region framework, and Figure El presents the region 
formation algorithm |88| . Demand-region extends Phased-region in several important ways in order to form 
interprocedural regions without aggressive inlining. First, when a callsite is encountered as a region is being 
grown, FormRegions recursively calls itself to continue to grow the current region in the callee in the context 
of the caller, but without inlining at that time. Second, in order to minimize the size of the data structures 
maintained at any given time during region formation, all regions within a called procedure will be identified 
before FormRegions returns to region formation in the caller. Third, to enable formation of interprocedural 
regions through this recursive approach, FormRegions operates on regions rather than just basic blocks. 

FormRegions begins with a worklist B of all blocks in the current procedure / for which it is forming 
regions. Successor and predecessor blocks are added to the current region only if they are desirable as defined 
in Sectional Desirable (x,y) plays this role. Non-callsite blocks are appended to the region as in Phas- 
ed-region. When a callsite c is reached in the analyzed code, the recursive call to FormRegions forms regions 
local to the callee, say g, and then FormRegions returns with the entry and exit regions of g. 

If there was not a pass-through region of g, entryg^c is concatenated with the region R currently being 
formed in / when the callsite was encountered (which completes that interprocedural region) , and this merged 
region is added to the local Rlist of completed regions in /. Next, a new region R is begun, consisting solely of 
exitg^c- If there is a pass-through region for g, this pass-through region is added to R, but R is not necessarily 
complete at this point. Region formation continues in / by adding blocks to R. Once all blocks on procedure 
/'s worklist B are exhausted, the return parameters entryR and exitR are assigned the regions in / that 
contain the entry and exit blocks, respectively. The local regions with respect to / (all regions except the 
entry and exit regions of /) are optimized and code is generated for them, prior to returning the entry and 
exit regions. 

The main steps of FormRegions are illustrated for a single callsite by the interprocedural CFGs in FigureEl 
For clarity, the same fill patterns are used to differentiate the steps of the Demand-region algorithm in this 
figure as were used to describe the Phased-region algorithm in Figure |21 In this example, a pass-through 
region of G exists, is returned to F by FormRegions as both entryR and exitR, and is appended to the 
currently forming region R. Procedures that are not inlined at every callsite, not inlined at all, or are potential 
procedure aliases, are identified after the region formation that began with the main program is complete. 
The parameter to FormRegions named isolated is set for these isolated procedures to indicate that only 
local regions are to be formed. 
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procedure FormR.egions(f, isolated, cntryR, cxitR) { 

B — all blocks in proc f 

Rlist = 

while (blocks remain in B) { 

R = Seed(B) 

seed — last block in R, 

// Add successors to the region 
X — seed 

y — most frequent successor of x 
while (y ^ R && Desirable(x,y)) { 
if (y is proc call && y is inlinable) { 
FormR,egions(callec(y), 0, entryR, exitR) 
if (entryR 5^ exitR) { 
R = R U entryR 
Rlist = Rlist U R 
R = 

} 

S = exitR 

} 
else 

s = {y} 

R= R U S 

X = y 

B = B - {y} 



} 



y ■ 



most frequent successor of x 



// Add predecessors to region, analogous to adding 
// successors - code omitted for space limitations 



j I Add desirable successors to seed path 
stack — R, 
while (stack # 0) { 
X — Pop(stack) 

foreach successor of x, y G B { 
if (Desirable(x,y)) 
if (y is proc call && y is inlinable) { 
FormR,egions(callee(y), 0, entryR, exitR) 
if (entryR i^ oxitR) { 
R = R U entryR 
Rlist = Rlist U R 
R = 



} 



cxitR 



} 



} 

else { 

S = {y} 

Push(stack,y) 
B = B - {y} 

} 

R = R U S 

}} 

// Copy tail & add region to Rlist 
B = B U TailDuplication(R) 
Rlist = Rlist U R 

} 

// Remove entry & exit regions from list 
// generate code for regions local to f 
cntryR — region in Rlist with entry of f 
cxitR — region in Rlist with exit of f 
if (not isolated) 

Rlist ^ Rlist - (cntryR U cxitR) 
CodcGcn(Rlist) 



procedure Sccd(B) { 
s — block with maximum weight in B 
B ^ B - s 
if (s is proc call) { 
FormR.egions(callcc(s), 0, cntryR, cxitR) 
if (cntryR / cxitR) { 
R ^ R U entryR 
Rlist ^ Rlist U R 

} 

S ^ cxitR 

} 
else 

S = {s} 
return S 

} 

procedure CodcGcn(Rlist) { 
foreach region R £ Rlist 

optimize R 
generate code for Rlist 

} 

Main() { 
ForiTiR.egions(?nam, 1, entryR, exitR) 
foreach proc f ^ main 
if (not all callsitcs to f were inlincd) 
FormR,egions(f, 1, cntryR, exitR) 

} 



Figure 5: Intcrproccdural algorithm for region formation with demand-driven inlining 
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(a) Seed (block 2) is selected as it is 
the most frequently executed block 
in proc F. Successors (block 4) are 
selected until a callsite is reached. 





(b) Region formation is performed 
recursively in callee G, where local 
regions are formed. Blocks 8, 10 & 
11 form one region, with block 9 
as a local region. Region formation 
then continues in F. 

Figure 6: Example of Demand_region 



(c) Successor path is completed (5 
& 7), predecessors are added (1), de- 
sirable successors are added (3). Lo- 
cal regions are formed from remaining 
blocks (6). Region formation is com- 
plete. 



3.4 Empirical Evaluation 

An experimental comparison of the two region formation approaches, Demand_region and Phased_region, is 
described in terms of compilation memory requirements, code growth and runtime performance. Analysis of 
the characteristics of the resulting units of compilation, including the size, homogeneity of profile weights, 
and code size is performed to explain the results. 

3.4.1 Methodology 

These experiments were conducted using the Trimaran compiler system J2B • With Phased-region as an ex- 
isting component. Trimaran was a natural choice for this research. Significant implementation was performed 
to add the capability of demand-driven inlining, and to create a region formation module that incorporates 
demand-driven inlining and optimization. Also added was the ability to annotate each basic block with its 
procedure of origin to enable identification of code that was inlined. For this set of experiments, ten C 
benchmarks were used from SPEC 92 and 95 (www . spec . org) representing a variety of computations, code 
sizes and program characteristics. Table Q includes numbers of source code lines and procedure definitions. 
The benchmarks were compiled under three scenarios: (1) procedure-based compilation without any 
inlining or region formation, (2) region-based compilation using the Phased_region approach, and (3) region- 
based compilation using the Demand-region approach. 



3.4.2 Results 

Compilation memory requirements 

Table Q compares the compilation memory requirements for Phased-region versus Demand_region. Due to 
design considerations of the Trimaran framework, direct measurement of memory requirements was not 
possible. Instead, measurements of whole program size, procedure sizes, and static call chain lengths were 
taken, and estimates of memory requirements were computed according to each strategy for region-based 
compilation. 

For Phased_region, the compilation memory requirements are computed as code size after aggressive 
inlining is performed, as measured in number of Lcode instructions, because the entire program may be 
held in memory during region formation and optimization (in the worst case). For Demand-region, first the 



Table 1: Comparison of memory requirements during region formation, measured in Trimaran Lcode instruc- 
tions. 



Benchmark 


General 


Phased^region 






Deman 


djregion 




Lines of 

C source 

code 


Num. 

of 
procs. 


Memory 

requirement 

Total 


Static 

call chain 

Avg. Max. 


Procedure 

size 

Avg. Max. 


Memory 

requirement 

Avg. Worst 


008. espresso 


14850 


361 


73997 


5 


11 


183 


2059 


3186 


5175 


O23.eqntott 


3628 


62 


11738 


3 


7 


230 


1757 


1156 


2538 


026. compress 


1503 


16 


2601 


2 


5 


224 


1761 


1270 


1800 


O99.go 


29246 


383 


110842 


9 


23 


117 


1109 


1076 


3085 


124.m88ksim 


19092 


252 


55783 


6 


11 


193 


1537 


1195 


1923 


126.gcc 


205627 


1170 


1050754 


5 


13 


202 


1810 


2666 


4391 


130.1i 


7597 


357 


31552 


22 


35 


112 


987 


1640 


3197 


132.ijpeg 


29259 


473 


112188 


8 


14 


124 


2510 


1385 


2185 


134.perl 


27044 


316 


100063 


5 


15 


140 


1977 


1498 


2732 


147. vortex 


67202 


1127 


302409 


4 


12 


131 


2301 


1166 


2210 


average 


40505 


452 


17363 


6 


11 


162 


1397 


1274 


2228 



Table 2: Percentage difference in average and maximum memory requirements of Phased_region and De- 
mand_region. 



Benchmark 


Demani 
Phased 
average 


i_reqion n-, 

-■ /o 

.region 

maximum 


008. espresso 


4.3 


7.0 


O23.eqntott 


9.8 


21.6 


026. compress 


48.8 


69.2 


O99.go 


1.0 


2.8 


124.m88ksim 


2.1 


3.4 


126.gcc 


0.3 


0.4 


130.h 


5.2 


10.1 


132.ijpeg 


1.2 


1.9 


134.perl 


1.5 


2.7 


147.vortex 


0.4 


0.7 


average 


7.5 


12.0 



average and maximum sizes of procedures in a benchmark were calculated. Next, the lengths of static acycUc 
call chains were measured at the source code level. The call chain length and procedure size information 
were then used to compute the average and maximum of the sum of procedure sizes along the average and 
maximum length call chains. The average value provides a good estimate of typical compilation memory 
usage for purposes of comparison, while the maximum value indicates the worst case. 

The data in Table |2] shows that on average. Demand-region uses about 7.5% of the memory required by 
Phased_region for region formation for the benchmarks studied, over a range of roughly <1% to 49%. In the 
worst case. Demand-region uses an average of 12% of the memory required by Phased_region over a range 
of about <1% to 69%. Benchmarks with larger numbers of procedures and procedure calls, and more and 
longer call chains, benefited the most from Demand_region. While smaller benchmarks showed some benefit, 
the smallest, 026. compress, showed the least benefit, suggesting that Demand-region may be best suited to 
large applications. 



Code growth 

Code growth was measured as the percentage change in overall code size from the original program, shown 
in Table as the percentage increase or decrease in size. To measure their code size used to calculate code 
growth, each benchmark was compiled in three ways: (1) without any inlining or region formation, (2) using 



Table 3: Percentage change in code growth for Phased_region and Demand_region. 



Benchmark 


Phased^region 


Demand^region 


Demand^region 
— Phased^region 


008. espresso 


21 


19 


-2 


O23.eqntott 


24 


26 


+2 


026. compress 


26 


25 


-1 


O99.go 


22 


25 


+3 


124.m88ksim 


21 


20 


-1 


126.gcc 


22 


23 


+1 


ISO.li 


20 


23 


+3 


132.ijpeg 


21 


24 


+3 


134.perl 


22 


23 


+1 


147.vortex 


21 


21 





average 


22.0 


22.9.1 


+0.9 



Table 4: Percentage change in execution time for Phased_region and Demand-region compared to procedure- 
based. 



Benchmark 


Phased^region 


Demand-region 


Demand-region 
— Phased-region 


008. espresso 


-6.13 


-1.12 


5.01 


023.eqntott 


-3.17 


-2.14 


1.03 


026. compress 


-3.11 


26.88 


29.99 


O99.go 


-6.28 


7.30 


13.58 


124.m88ksim 


-4.65 


-2.40 


2.25 


126.gcc 


-6.72 


-5.00 


1.72 


130.h 


-8.49 


12.50 


20.99 


132.ijpeg 


-7.01 


-5.99 


1.02 


134.perl 


-4.22 


-2.18 


2.04 


147. vortex 


-6.90 


-3.72 


3.18 


average 


-5.67 


2.41 


8.08 



the Phased-region strategy, and (3) using the Demand-region strategy. Measurements were taken in terms of 
Lcode instructions of the resulting compiled programs. An increase in code size is represented by a positive 
value. For example, a value of 21 means that after compilation within a particular framework, the program 
is 21% larger than the same program compiled using the procedure-based approach. 

On average. Demand-region introduces < 1% more code than Phased-region, over a range of 2% less to 
3% more growth. In general, differences in code growth are not dramatic, due to the use of the global static 
code growth limit of 20% in both Phased-region and Demand-region. In practice, the 20% code growth limit 
prevents inlining once the code size has grown to 20% or more above the original size. However, a benchmark 
may grow to just below this limit, allowing one more instance of inlining to be performed. Demand-region 
shows slightly more code growth than Phased-region because Demand-region is inlining in a different order, 
which can lead to the benchmark first growing to just below the limit, and then inlining a larger procedure 
which exceeds the limit considerably. 

Runtime performance 

Table 0] reports the percentage change in execution time. Negative values for percentage change in execution 
time indicate a performance speedup; the program ran faster compared to the procedure-based compilation. 
The last column shows the difference in the change in execution time between Demand-region and Phas- 
ed-region, with a negative value indicating that a benchmark compiled using Demand-region ran faster than 
when compiled with Phased-region; a positive difference indicates that Phased-region was faster. 

For seven of the ten benchmarks, the results for execution time were quite similar for Phased-region 
and Demand-region, separated only by a few percentage points, which equates to fractions of a second in 
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Table 5: Comparison of number of compilation units for procedure-based, Phased_region and Demand-region 
(in Lcode instructions). 



Benchmark 


Proc. -based 


Number of Units 
Phased.region DemancLregion 


Demand-region 
— Phased-region 


008. espresso 


361 


1787 


1774 


-13 


O23.eqntott 


62 


436 


476 


40 


026. compress 


16 


117 


102 


-15 


O99.go 


383 


1838 


1888 


50 


124.m88ksim 


252 


1336 


1322 


-13 


126.gcc 


1170 


6084 


6047 


-37 


130.11 


357 


801 


793 


-8 


132.ijpeg 


473 


3575 


3791 


216 


134.perl 


316 


822 


797 


-25 


147.vortex 


1127 


5522 


5616 


94 


average 


452 


2232 


2261 


29 



wall clock time. In particular, there are little or no differences in performance for 008. espresso, O23.eqntott, 
124-m88ksim, 126. gcc, 132.ijpeg, 134.perl and 147. vortex. The drop in performance from Phased-region to 
Demand-region for 026. compress, 099. go and 130. li is due to naive heuristics for deciding whether to perform 
demand-driven inlining at a given callsite, and the way the prototype system handles demand-driven inlining 
of indirect recursive procedure calls. Specifically, with this implementation, it is possible for the code limit 
to be reached before inlining is performed in some of the high execution frequency regions, resulting in 
optimization loss. ILP processor utilization was also examined, with only insignificant variations noted. 

Thus, while memory requirements are improved dramatically, runtime performance remains virtually un- 
affected in general. This improvement in memory requirements was the primary goal of performing demand- 
driven inlining during region formation in Demand-region. Since Demand-region is implemented using the 
same region formation and inlining heuristics, leading to substantially similar regions, dramatic improve- 
ments to runtime performance could not be reasonably expected. The key innovation of Demand-region is 
to integrate demand-driven inlining into region formation to reduce the requirements for memory during 
compilation. 

3.5 Analysis of Compilation Unit Characteristics 

Procedure restructuring affects the characteristics of the unit of compilation. Analyzing changes to program 
characteristics, such as the size, profile homogeneity and interprocedural scope of the unit of compilation, 
can further explain the impact on memory requirements, code growth and performance. 

Unit size 

Tables |3 and report the total number of compilation units and average size in Lcode instructions for 
each of the studied benchmarks under the three different strategies for compilation. The two region-based 
compilation techniques result in very similar average region size and total number of regions, while the 
procedure-based strategy produces far fewer, though far larger, compilation units. Slight variations in sizes 
and numbers of regions are attributed to differences in the order in which callsites are inlined. The aggressive 
inlining of Phased-region favors inlining frequently executed, smaller procedures over larger procedures due 
to the limit it places on total code growth and the inlining heuristic. Since the demand-driven inliner inlines 
as it is creating a region and reaches a callsite, it can reach the same specified limit for code growth at a 
different time due to different order of inlining. The demand-driven approach to inlining in Demand-region 
and the recursive nature of the algorithm lead to a bottom-up inlining of regions. That is, the inlining is 
performed as the recursive calls to FormRegions return. The contribution of the Demand-region approach is 
that it can significantly reduce compilation memory requirement, while creating number and size of regions 
comparable to those created by Phased-region. 
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Table 6: Comparison of average size of compilation units for procedure-based, Phased_region and Demand-re- 
gion in Lcode instructions). 



Benchmark 


Proc. -based 


Average unit size 
Phased.region DemancLregion 


Demand-region 
— Phased-region 


008. espresso 


183 


50 


49 


-1 


O23.eqntott 


169 


33 


31 


_2 


026. compress 


152 


28 


32 


+4 


O99.go 


234 


55 


51 


-4 


124.m88ksim 


155 


31 


30 


-1 


126.gcc 


530 


62 


57 


-5 


130.11 


81 


47 


49 


+2 


132.ijpeg 


137 


38 


37 


-1 


134.perl 


206 


51 


49 


-2 


147.vortex 


161 


35 


34 


-1 


average 


201 


43 


42 


-1 



Table 7: Comparison of percentage of invariant compilation units and profile variance (homogeneity) for 
procedure-based, Phased_region and Demand_region. 



Benchmark 


Proc 
Profile 
variance 


-based 
Pet. units 
invariant 


Phase 
Profile 
variance 


i^region 
Pet. units 
invariant 


Demand^region 

Profile Pet. units 

variance invariant 


008. espresso 


0.362 


88.7 


0.340 


81.2 


0.342 


94.1 


O23.eqntott 


0.017 


96.1 


0.001 


97.6 


0.020 


97.6 


026. compress 


0.313 


90.7 


0.245 


87.5 


0.375 


90.8 


O99.go 


0.292 


87.1 


0.293 


88.2 


0.293 


91.0 


124.m88ksim 


0.272 


92.1 


0.249 


92.9 


0.292 


93.3 


126.gcc 


0.132 


89.0 


0.108 


90.1 


0.108 


93.2 


130.1i 


0.198 


90.3 


0.208 


88.5 


0.203 


94.2 


132.ijpeg 


0.273 


88.8 


0.254 


86.7 


0.310 


91.1 


134.perl 


0.212 


88.7 


0.195 


89.3 


0.187 


90.3 


147. vortex 


0.310 


90.7 


0.259 


91.1 


0.261 


93.1 


average 


0.238 


90.2 


0.215 


89.3 


0.239 


92.9 1 



Profile homogeneity 

Profile homogeneity is defined as the measure of how similar the given unit of compilation is in terms of profile 
weight per instruction, operation or basic block. This variation on code density provides an indicator for the 
impact of region formation on optimization. More homogeneous compilation units enable the optimizer to 
easily identify and isolate heavily executed regions, and then selectively focus more attention on these more 
important regions and less attention elsewhere. This partitioning reduces the chance of leaving important 
portions of the code unoptimized or spending excessive time optimizing unimportant code. 

Within the context of units of compilation, the profile homogeneity, or profile variance, is defined to be 
the measure of the degree of deviation, that is, the standard deviation, in profile weights for all basic blocks 
within a compilation unit. Table [7| shows the average profile variance and percentage of compilation units 
that are invariant for each benchmark. The average profile variance is an overall indication of how consistent 
the profile weights are within each of the benchmarks' units of compilation. The closer the profile variance is 
to 0, the less variation there is in the profile weights overall for the benchmark, and the more homogeneous 
the benchmark. 

The results in Table [7| indicate that in every case Demand_region improves percentage of invariant units 
over both procedure-based compilation and Phased-region. Phased-region tended to gain in some cases and 
lose in others over procedure-based compilation. When there is an increase in the percentage of invariant 
code, there is generally also an increase in the profile variance of the code overall. This is due to the 
procedure restructuring done by region formation, which favors grouping more frequently executed code 
together, leaving less frequently executed code behind. Because less important code is not actively formed 
into more homogeneous regions, the profile weights of their containing regions are slightly more variant than 



12 



Table 8: Comparison of percentage of interprocedural operations in Phased_region and Demand-region. 



Benchmark 


Phased-region 
% interproc. 


Demand-region 
% interproc 


Demand-region 
— Phased-region 


008. espresso 


20.7 


24.3 


3.6 


O23.eqntott 


18.0 


23.8 


5.8 


026. compress 


23.5 


25.9 


2.4 


O99.go 


28.4 


26.7 


-1.7 


124.m88ksim 


22.9 


24.9 


2.0 


126.gcc 


19.3 


25.4 


6.1 


ISO.li 


30.2 


28.0 


-2.2 


132.ijpeg 


23.0 


25.1 


2.1 


134.perl 


20.0 


24.9 


4.9 


147.vortex 


23.0 


25.7 


2.7 


average 


22.9 


25.5 


2.6 



regions of frequently executed code. It can be hypothesized that Demand-region produces less variant code 
over Phased-region because the integrated, demand-driven use of inlining within Demand-region uses the 
region formation desirability heuristic (50% or greater execution frequency) to also guide inlining. Overall, 
the frequency of code inlined by Demand-region is likely to be greater than the more general, aggressive 
inlining approach in Phased-region. 



Interprocedural scope 

When specifically comparing regions to procedures, and regions formed using different techniques, a change 
in the number of interprocedural regions and the amount of interprocedural operations per region indicates the 
change in interprocedural scope. Recall that an interprocedural region is a region that includes instructions 
from more than one procedure. Interprocedural operations are the instructions in an interprocedural region 
that are from procedures other than the procedure in which formation of the region began. Before inlining is 
performed, all basic blocks are annotated with the block's procedure of origin. With this origin information, 
the impact of region formation on interprocedural scope in a unit of compilation can be measured directly by 
calculating how much of the code within each unit originated outside itself. The percentage of interprocedural 
code in the program is measured as a simple ratio of the number of interprocedural operations to total 
operations. 

TablelHlshows the average percent of code within regions that is from a procedure outside the region (i.e., 
interprocedural code). An improvement in the percentage is indicative of better interprocedural scope. An 
increase in interprocedural scope within the unit of compilation means that the potential for interprocedural 
optimization is increased without additional analysis. 

In general, the percentage of interprocedural operations is similar for Phased-region and Demand-region. 
The differences in interprocedural scope under the two techniques are slight. The interaction of various 
factors leads to insight on how to improve the techniques. The slight increase in interprocedural scope 
for 008. espresso occurs with a slight decrease in code growth and little or no change to profile variance. 
For O23.eqntott, slight differences in code growth and variance would not indicate the larger increase in 
interprocedural scope seen for Demand-region. This change could be due to the slight reduction seen for 
average size of the unit of compilation for Demand-region for O23.eqntott, since the other factors were quite 
similar. The improvements to interprocedural scope seen with 126. gcc, 132.ijpeg, 134. perl ^-nd 14-7. vortex are 
likely due to slight decreases in the average size of the unit of compilation, which arc magnified due to the 
large sizes of the benchmarks. Most puzzling is the increase in interprocedural scope seen with Demand-region 
applied to 026. compress, 099. go, 130. li, and to a lesser extent, 124.m88ksim, which exhibit significantly more 
variance and a definite reduction in runtime performance versus Phased-region. This seeming contradiction 
for these four benchmarks could be due to the effect of gaining interprocedural scope by restructuring, with 
the side-effect of leaving behind more invariant code in the process. An increase in code growth appears to 
be the cause of decreased interprocedural scope for ISO.li. 
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Figure 7: Relationship of first- and second-order heuristics to region formation and the demand-driven inliner. 

4 Heuristics for Demand-driven Inlining 

In the previous section, baseline heuristics were used to guide demand-driving inlining within region forma- 
tion to establish the efficacy of the Demand_region approach as compared with Phased-region and traditional 
procedure-based compilation. This section explores a variety of heuristics designed to improve the perfor- 
mance of Demand-region, and discusses a number of classifications, factors and important issues that are 
integral to inlining heuristics design. 

Because region formation drives inlining, the heuristics for a demand-driven inliner must consider the 
order that procedures are processed by the region formation phase and the characteristics of a callee at each 
callsite as it is encountered during region formation \IV2\ . Procedures that are analyzed later in the compilation 
may result in less inlined code within them and thus be less optimized since code growth restrictions could 
limit further inlining, and thus limit the interprocedural scope of that procedure. Therefore, procedures that 
have the highest potential for optimization, particularly instruction scheduling for ILP architectures, should 
be processed first by the region formation analysis phase. Thus, demand-driven inlining within a region-based 
compiler involves two general classes of heuristics, defined as: first-order heuristics that determine the order 
in which procedures are processed during region formation, and second- order heuristics that govern decisions 
about whether to inline at each callsite. Figure illustrates the location within a region-based compilation 
framework of these two heuristics. 

4.1 First-order Heuristics 

First- order heuristics select the order to consider procedures for region formation, which will implicitly affect 
the order of demand-driven inlining decisions. Because demand-driven inlining within a given procedure 
is considered at callsites as region formation is performed for that procedure, the order of decisions for 
demand-driven inlining follows the flow-directed manner in which region formation is performed within a 
given procedure's control flow graph. 

The first-order heuristics studied in the research attempt to order procedures from most to least important 
in terms of optimization opportunity. In particular, three possible first-order heuristics for demand-driven 
inlining were examined. The most precise measurement of procedure importance is actual dynamic run-time 
profiling which comes at the cost of an initial instrumentation, compilation and execution. Procedures are 
ordered from highest to lowest percentage of overall run-time spent in the procedure, based on profiling 
information. It is worth noting that procedures which consume larger portions of execution time are likely to 
contain loops and callsites within the loops, which supports the importance of this heuristic to interprocedural 
region formation in a demand-driven framework. 

Static estimates of importance provide less costly heuristics, but also tend to produce less precise informa- 
tion. One heuristic based on static estimates orders procedures from most to least number of static callsites 
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within the procedure, and within that order from smallest to largest procedure size. More importance is 
assigned to procedures with the highest percentage of callsites compared with code size. This increases the 
chance that region formation will be performed interprocedurally, producing more scheduling and optimiza- 
tion opportunities, while controlling code growth by considering smaller procedures before larger ones. 

Another ordering considered is based on the loop call weight of a procedure, assigning more importance 
to procedures which contain more callsites within loops, and increased importance for those callsites that 
are more deeply nested. The loop call weight is computed as: Y^^=i loopdepthi x W , where n is the number 
of callsites in a procedure, and W is the loop depth weight constant. A value of 10 is used for W to assign an 
order of magnitude increase in significance to successive loop depths, since, intuitively, interior loops consume 
more execution cycles than do their enclosing loops. 



4.2 Second-order Heuristics 

Second-order heuristics involve the decision about whether to inline each callee within a procedure as region 
formation reaches that callsite. While there are a number of heuristics already developed for this decision 
making, they have all been applied within a separate inlining phase without consideration of the interac- 
tions with region formation analysis, and in particular, demand-driven inlining. The second-order heuristics 
attempt to increase instruction scheduling and optimization opportunities while minimizing code growth. 
For correctness, procedures where there are mismatches in the number and types of parameters between 
the callsite and callee, when the compiler determines that memory regions associated with arguments to a 
procedure may overlap or are pointers, are not inlined. 

To avoid high code growth, inlining is prevented once the the overall code size has increased more than 
20% percent above the original size. A code growth limit of 20% has been shown to minimize unnecessary 
code growth while still allowing beneficial inlining |20| . Similarly, inlining is prevented for procedures that 
are directly or indirectly recursive to avoid the potential for excessive code growth. 

Procedures that are more frequently executed than a fixed frequency or with some desired ratio over 
the frequency of the caller are inlined. Region formation already uses this second-order heuristic, such that 
inlined procedures will always be executed at least 50% as frequently as the seed block of their enclosing 
region. Only procedures that are less than a static maximum size are inlined to limit code growth, and 
procedures with higher call overhead compared with their code size are inlined. 

4.3 Empirical Evaluation 

Experiments were conducted to study the effectiveness of a number of heuristic combinations (Table IHl), and 
to determine which strategies can improve characteristics of the program and its runtime performance. The 
heuristic combinations were compared by measuring three effects in terms of the percentage change of each 
combination versus H0, the baseline method. In particular, the effects that were evaluated include: (1) code 
growth, (2) compilation time, including the time to compile the source code up through region formation 
and region-based optimization, and (3) execution time, which measures more directly the impact of inlining 
heuristics on region formation and region-based optimization, and ultimately on runtime performance. Note 
that the results in this section cannot be compared directly to those in Section 13 due to slight variations in 
implementation needed to incorporate the newer inlining heuristics. 

When designing inlining heuristics, first-order heuristics should not ignore the goals of second-order heuris- 
tics. In particular, first-order heuristics should anticipate second-order heuristics by processing procedures 
early in the compilation that will benefit most from the interprocedural scope gained from demand-driven 
inlining. Second-order heuristics should rely on first-order heuristics to provide more important procedures 
earlier in the compilation, while constraining code growth so that procedures remaining to be handled by 
region formation can still benefit from demand-driven inlining. While the heuristics are the same for Phas- 
ed-region in Scction|3and HI in this section, for example, the implementation of the heuristics was modified 
to enable consistent comparison of results with the newer heuristics. Experimental results reported in Sec- 
tion 13 enable the initial valid comparison of Demandjregion with the original, unmodified Phased_region 
framework. 
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Table 9: Sumniary of heuristic combinations. 



Name 


First-order 


Second-order 


Intuition/Motivation 


H0 


None 


None 


Baseline version of original region-based compila- 
tion. No inlining is performed. 


HI 


Run-time profile order- 
ing. 


Inlined procedures guar- 
anteed to be executed at 
least 50% as frequently 
as seed block in their re- 
gion 20 . 


Original region-based compilation. Aggressive in- 
lining with standard code growth limit, then region 
formation; first- and second-order inlining heuris- 
tics as defined by |20|. 


H2 


Order by descending 
number of static callsites, 
then ascending procedure 
size. 


Same as HI, plus only in- 
line if callee size < 25 IT^. 


Demand-driven inlining with simple static heuris- 
tics; avoid more costly analysis in order to poten- 
tially improve compilation time. 


H3 


Same as H2. 


Same as HI, plus prevent 
inlining direct or indirect 
recursion. 


Demand-driven inlining with simple static heuris- 
tics. Increase number of procedures into which in- 
lining is performed before code growth limit is 
reached, preventing successive inlining of recursive 
procedures. 


H4 


Order by decreasing loop 
call weight, then ascend- 
ing procedure size. 


Same as H3. 


Static estimation of profile information by equating 
loop characteristics with predicted execution fre- 
quency, for improved compilation time. 


H5 


Order by decreasing ex- 
ecution cycles, then as- 
cending procedure size. 


Same as H3. 


Actual runtime profile information should provide 
most precise information for guiding region forma- 
tion and demand-driven inlining, for improved run- 
time performance. 


H6 


Same as H5. 


Same as H3, plus mini- 
mum loop call weight of 
10 to inline. (Note: a pro- 
cedure containing a sin- 
gle loop is assigned a loop 
call weight of 10.) 


Only inline if contains at least 1 call within at least 
one loop. Combines profile information to prioritize 
compilation of procedures, with potentially faster 
static loop characteristic estimation for making in- 
lining decisions; should improve compilation time. 



4.3.1 Methodology 

Implementation of the described techniques and experiments has been conducted in context of the Trimaran 
compiler '281. The existing region formation module was enhanced to incorporate additional first-order 
inlining heuristics. The demand-driven inlincr within Demand_region was extended with a number of new 
second-order inlining heuristics. In addition, the demand-driven inlincr was more tightly integrated into the 
compiler, enabling a meaningful measurement of compilation time. The experiments were performed on the 
same set of benchmarks (Tabled P-EJ- 



4.3.2 Results 

Code growth 

Table mi reports the percentage increase in code growth for heuristics HI through H6 versus the baseline 
compilation H0. Code growth was measured directly by counting the number of Lcode instructions resulting 
from compilation using each of the heuristics. 

Heuristic H2 does a significantly better job than any of the other methods at limiting code growth. 
This is not surprising, since it uses a simple, static threshold that only allows inlining of small procedures. In 
general, heuristic HI, or Phased-region in its basic form, does a little better in most cases than the remaining 
heuristics based on demand-driven inlining. The changes in code growth are generally slight, and in nearly all 
cases remained under the static limit of 20% used in earlier experiments, indicating these heuristics provide 
improved code growth control. Code growth for compress exceeded 20% for H3, H4, H5 and H6. This is 
due to an order of region formation, and therefore inlining, that causes a very large procedure to be inlined 
when the code growth limit had already nearly been reached. 
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Table 10: Percentage change in code growth over H0. 



Benchmark 


HI 


H2 


H3 


H4 


H5 


H6 


008. espresso 


8 


1 


20 


20 


17 


17 


O23.eqntott 


6 





18 


18 


15 


15 


026. compress 


17 





23 


23 


21 


21 


O99.go 


9 


3 


12 


11 


8 


8 


124.m88ksim 


8 


1 


16 


15 


12 


12 


126.gcc 


10 


2 


15 


15 


8 


8 


130.11 


8 





8 


6 


4 


4 


132.ijpeg 


14 


3 


17 


17 


15 


15 


134.perl 


11 


1 


18 


17 


12 


12 


147.vortex 


15 


3 


16 


16 


13 


13 


average 


11 


1 


16 


16 


13 


13 



Table 11: Percentage change in compilation time over H0. 



Benchmark 


HI 


H2 


H3 


H4 


H5 


H6 


008. espresso 


2.1 


0.0 


2.0 


4.2 


4.9 


4.5 


O23.eqntott 


1.8 


-0.1 


2.4 


4.8 


9.3 


7.9 


026. compress 


-8.3 


-1.3 


-2.8 


0.0 


5.6 


2.8 


O99.go 


3.0 


-2.5 


6.5 


7.8 


10.0 


9.3 


124.m88ksim 


4.0 


-2.1 


27.6 


18.4 


27.6 


27.4 


126. gcc 


2.9 


-0.3 


21.1 


15.2 


15.4 


15.2 


130.h 


4.5 


0.0 


26.8 


24.5 


25.6 


24.9 


132.ijpeg 


3.5 


-1.0 


13.9 


13.5 


14.0 


13.8 


134.perl 


2.8 


-1.4 


14.8 


14.2 


15.3 


15.1 


147. vortex 


3.7 


0.1 


4.8 


7.8 


3.0 


3.0 


average 


2.0 


-0.9 


11.7 


11.0 


13.1 


12.4 



Compilation time 

Results for compilation time for the heuristics are presented in Table [TTl The change in compilation time as 
compared with H0 is shown as a percentage increase (positive) or decrease (negative). Compilation time 
for each heuristic was measured by timing the compilation through the optimized Lcode phase, just prior 
to the point when Trimaran outputs instrumented code for simulated execution on the target architecture. 
This timing includes any applicable phases for profiling, intermediate code generation, aggressive inlining, 
region formation (which may or may not include demand-driven inlining), and region-based optimization. 
The timings used were system times (i.e., wall clock times) accurate to the nearest 10th of a seconds, and 
were on the order of minutes or hours (not unusual for a research compiler). 

In general, compilation time improves the most for H2 which uses the simplest inlining heuristic, and 
HI, the Phased-region compilation method. Due to the overhead introduced in the current implementation 
of demand-driven inlining, unusually high increases in compilation time were seen in most other cases where 
demand-driven inlining is used (H3 through H6). 

It is interesting to note that for some of the benchmarks, 008. espresso, O23.eqntott, 026. compress, and 
147. vortex, compilation time increased only slightly over HI and H2 for the remaining heuristics. This 
indicates that other more complex factors may be helping to control compilation time in spite of the more 
advanced and time-consuming inlining heuristics being used. 

Runtime performance 

Relative changes in performance between the baseline heuristic H0 and the others are shown in Table IT^ 
Performance was measured by running the programs using the Trimaran simulator, which involved an addi- 
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Table 12: Percentag 


e chang 


e in execution time over H0. 


Benchmark 


HI 


H2 


H3 


H4 


H5 


H6 


008. espresso 


-4.01 


0.50 


-5.50 


-7.35 


1.75 


1.75 


O23.eqntott 


-3.17 


1.31 


-4.02 


-6.11 


-1.90 


-1.90 


026. compress 


-3.11 


0.00 


-3.11 


-3.11 


-2.98 


-2.98 


O99.go 


-3.19 


0.07 


2.30 


-4.20 


-4.10 


-4.10 


124.m88ksim 


-6.13 


-1.31 


-3.90 


-9.22 


-9.13 


-9.13 


126.gcc 


-5.20 


-1.03 


-4.15 


-0.24 


-10.20 


-10.20 


130.1i 


-8.49 


-2.16 


-4.01 


-12.53 


-12.20 


-12.20 


132.ijpeg 


-2.98 


0.12 


-4.77 


-7.33 


-6.97 


-6.97 


134.perl 


-5.50 


-1.90 


-4.79 


-10.43 


-9.54 


-9.54 


147.vortex 


-3.05 


-1.71 


-5.10 


-9.25 


-9.21 


-9.21 


average 


-4.48 


-0.61 


-3.71 


-6.98 


-6.45 


-6.45 



tional phase of compilation to instrument the Lcode output from region formation to execute in the simulated 
EPIC environment described earlier. 

H3 was competitive with HI, but H4, H5 and H6 all showed general improvements in performance 
over HI. Overall, the best performance speedup was consistently demonstrated with heuristic H4, which 
uses the static loop call weight estimator and recursion prevention to guide inlining decisions. There was also 
little or no significant change in processor utilization (i.e., CPI) for most of the benchmarks under most of 
the heuristics. 

4.3.3 Discussion 

The code growth, compilation time and runtime performance for the benchmarks under different inlining 
heuristics interact in a number of ways. For the cases that cause more code growth, execution time also 
improves. The more naive heuristics of H2 lead to the smallest increases in code size and compilation time, 
but also do not improve performance as much as the other more sophisticated methods. Larger increases in 
code growth and compilation time do not always translate to improvements in execution time, indicating 
that bounding code growth is indeed important, as was believed. For example, when H3 was applied to 
099. go, code size and compilation time increased more than with H2 while execution took longer, possibly 
due to recursive inlining of less important code. 

The more scientific codes (184-in88ksim, 132.ijpeg, 14-7. vortex), tend to benefit the most from increases 
in code growth and compilation time (which is also optimization and scheduling time) in terms of their 
speedup, particularly for the most advanced profile-estimating (H4) and profile-based (H5 and H6) methods. 
Smaller benchmarks (026. compress, O23.eqntott), by both size and number of procedures, are less predictable, 
although significant performance gains are seen with H3 through H6, with most showing improvement over 
the original region-based technique (HI). Benchmarks with more recursion [026. compress, 099. go, 130. li) 
require more compilation time and gain comparatively less in performance improvements than the others. 

The combination of heuristics in H4 proved consistently to be the most effective at controlling code growth 
and compilation time while improving runtime performance. The fact that H4 bases inlining decisions on the 
static loop call weight, which estimates runtime behavior, rather than the actual profiling information itself, 
as in H5 and H6, is significant, indicating that profiling may not be necessary for making good demand- 
driven inlining decisions during region formation. Profiling generally is more precise than static estimates 
because it directly measures program behavior at runtime, but requires more overhead and depends on the 
data used for the profiling. 

4.4 Impact on Compilation Unit Characteristics 

The experimental study in Section |21 examined how runtime performance can be improved by increasing 
interprocedural scope of compilation units, and reducing the profile variance of each unit. To test this hy- 
pothesis further, the characteristics of the compiled Lcode were measured after region formation for heuristics 
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Table 13: Comparison of number and average size of compilation units for three heuristics. 



Benchmark 


H0 

Avg Units 
size (proc.) 


HI 

Avg Units 
size (reg.) 


H4 

Avg Units 
size (reg.) 


008. espresso 


183 


361 


15.9 


4267 


15.0 


4710 


O23.eqntott 


169 


62 


16.6 


656 


16.2 


755 


026. compress 


152 


16 


16.8 


206 


16.4 


218 


O99.go 


234 


383 


17.0 


5424 


16.9 


5921 


124.m88ksim 


155 


252 


18.6 


3494 


17.7 


4027 


126.gcc 


530 


1170 


15.3 


41613 


14.9 


43971 


130.11 


81 


357 


14.9 


2427 


15.6 


2516 


132.ijpeg 


137 


473 


14.1 


4720 


13.8 


5109 


134.perl 


206 


316 


15.2 


4395 


14.9 


4891 


147.vortex 


161 


1127 


16.9 


11026 


16.2 


11713 


average 


201 


452 


16.1 


7823 


15.8 


8383 



H0 (the baseline, with no inlining or region formation), HI [Phased-region, for comparison) and H4 (the 
overall best performing heuristic) . 

Unit size 

Table IT^ compares the compilation unit size characteristics resulting from the three heuristics, H0, HI, 
and H4. Both HI and H4 show significant improvement in control of the size of the unit of compilation, 
with average region sizes ranging from 3% to 19% of the original average procedure sizes. H4 consistently 
produces more compilation units than HI, which is reflective of comparative code growth measurements for 
the two heuristics. The average size of the unit of compilation decreases slightly from HI to H4 by 0.1 to 0.9 
Lcode instructions. Although such a slight decrease in the average sizes of compilation units cannot directly 
account for the longer compilation times seen with H4, the more significant increase factor is code growth 
which results from the increase in the number of compilation units that results from a decrease in average 
size; with more code to compile, compilation time naturally is increased. 

Profile homogeneity 

Table IT^ shows the profile homogeneity and percentage of invariant code for these three heuristics. In most 
cases, H4 improved upon the amount of invariant code versus HI, while showing slight to moderate increases 
in the profile variance. The consistent increase in variance with the attending increase in percentage of 
invariant compilation units indicates that H4, as compared with HI, is simultaneously improving the profile 
homogeneity of more compilation units while increasing the variance of a smaller number of compilation 
units by relocating the more variant code. The benefit seen to the percentage of invariant compilation units 
reflects the improvement in runtime performance for H4 over HI. 

Interprocedural scope 

TablelT^also compares the change in interprocedural scope for HI and H4 compared to the baseline heuristic 
H0, which had 0% interprocedural code since no inlining was performed. Interprocedural scope improved 
in all cases when using the demand-driven heuristics of H4, which showed improvements of 1.2% to 8.7% 
over HI, as well. Improvements for H4 as compared with HI were less significant for 008. espresso, 099. go, 
and 134. perl, which have more instances of direct recursion than ISO.li, which exhibits significant indirect 
recursion. Indirect recursion within region formation leads to increased interprocedural regions as procedures 
are inlined into other procedures. Direct recursion, or self-recursion, leads only to the inlining of a procedure 
into itself, if at all. The smaller size and lower number of procedures in O23.eqntott and 026. compress led to 
the larger improvements to interprocedural scope due to a higher proportion of smaller procedures. 
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Table 14: Comparison of percentage of invariant compilation units and profile variance (homogeneity) for 
three heuristics, and resulting interprocedural scope. 



Benchmark 


H0 

Profile Pet. units 
variance Invariant 


HI 

Profile Pet. units 
variance Invariant 


H4 

Profile Pet. units 
variance Invariant 


Pet. Interproc. 

ops. 
HI H4 


008. espresso 


0.368 


93.1 


0.288 


93.7 


0.361 


94.0 


18.0 


19.2 


O23.eqntott 


0.020 


97.7 


0.001 


97.8 


0.022 


98.6 


15.1 


23.8 


026. compress 


0.313 


92.1 


0.210 


93.7 


0.275 


93.8 


21.7 


25.9 


O99.go 


0.372 


91.5 


0.310 


91.9 


0.331 


93.1 


23.1 


24.9 


124.m88ksim 


0.308 


93.9 


0.260 


94.4 


0.311 


94.9 


17.1 


19.8 


126. gcc 


0.323 


92.9 


0.262 


93.4 


0.300 


93.9 


18.2 


20.4 


130.11 


0.208 


93.7 


0.208 


93.7 


0.203 


94.2 


24.4 


27.6 


132.1jpcg 


0.281 


93.2 


0.201 


93.8 


0.239 


94.2 


17.3 


19.3 


134.porl 


0.270 


94.3 


0.189 


94.5 


0.249 


95.2 


16.9 


18.1 


147.vortox 


0.280 


93.1 


0.211 


93.8 


0.271 


94.1 


19.0 


20.5 


average 


0.274 


93.6 


0.214 


94.1 


0.256 


94.6 


19.9 


24.1 



5 Related Work 

Region-based compilation remains an active area of research, with promising applications to a Java vir- 
tual machine [S] 131) including a variety of adaptive techniques "5", and ILP optimization and scheduling 
frameworks [261 1351 1^ . Region formation is a form of interprocedural data flow analysis, a well-researched 
area with many benefits to ILP |17[ I19L 1^ . Disadvantages are that during analysis it can have unscalable 
memory requirements '16' or require exponential time with respect to program size |16| . Advances address 
the issue of unscalable memory and time requirements by using modular [251 130| and demand-driven J16j 
approaches, while profile-driven analysis and optimization PJ Q 1101 1111 I12j are vital to code improvement 
and performance. 

Procedure inlining is used to eliminate call overhead |S1 |Sj leading to fewer and faster calls ^ , improve 
compiler analysis and optimization 01IH1, register usage, code locality and execution speed [5], provide more 
precise data flow information to generate more efficient code specialized to the callee T,"^, and enable intra- 
procedural analysis and optimizations such as constant propagation and elimination of redundant operations 
to be appHed at interprocedural scope |H1E|- However, infining can increase register pressure P 151 [T3]. 
code size "WW , instruction cache misses @I|H1|S], and compilation time, which is more critical during dy- 
namic compilation 6 . Extensive research into inlining heuristics and the factors that bolster or limit their 
effectiveness within procedure-oriented compilers has been performed E El 1 1-^1 1151 1211 1231 1241 EEI ■ 

6 Conclusions and Future Work 

Region-based compilation has already been shown to help increase ILP performance by enabling interpro- 
cedural code motion without the expense of large compilation units or interprocedural data flow analy- 
sis. This research has focused on improving the effectiveness of region-based compilation that integrates 
heuristics-guided inlining into region formation analysis. Experimental results comparing two region-based 
approaches demonstrated that a demand-driven approach to inlining, as compared to a phased aggressive 
inlining approach, can reduce memory requirements and code growth while improving runtime performance 
due to increased proflle homogeneity and interprocedural scope. These improvements are further enhanced 
by making more informed inlining decisions, and reordering the processing of procedures by a region-based 
compiler, leading to further improvements to compilation unit characteristics, reflected as improved perfor- 
mance. Heuristics based on static analysis can be as effective as proflle-based heuristics at guiding inlining 
decisions. 

Partial inlining is an inlining technique that selectively inlines portions of a callee procedure into a callsite 
rather than the entire body of the callee [ni ll4ini^l27| . Region-based compilation naturally enables a form of 
partial inlining for the optimization phase of compilation |2()j . The approach presented in this paper is being 
extended to the design of an algorithm for incorporating partial inlining into region-based compilation |31' , 
including its applicability to object-oriented programming. 
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